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ABSTRACT 

Motivation: The recent outbreak of severe acute respiratory 
syndrome (SARS) caused by SARS coronavirus (SARS- 
CoV) has necessitated an in-depth molecular understand- 
ing of the virus to identify new drug targets. The 
availability of complete genome sequence of several strains 
of SARS virus provides the possibility of identification of 
protein-coding genes and defining their functions. Com- 
putational approach to identify protein-coding genes and 
their putative functions will help in designing experimental 
protocols. 

Results: In this paper, a novel analysis of SARS genome 
using gene prediction method GeneDecipher developed in 
our laboratory has been presented. Each of the 18 newly 
sequenced SARS-CoV genomes has been analyzed using 
GeneDecipher. In addition to polyprotein 1ab', polyprotein 
1a and the four genes coding for major structural proteins 
spike (S), small envelope (E), membrane (M) and nucleo- 
capsid (N), six to eight additional proteins have been predicted 
depending upon the strain analyzed. Their lengths range 
between 61 and 274 amino acids. Our method also suggests 
that polyprotein 1ab, polyprotein 1a, S, M and N are pro- 
teins of viral origin and others are of prokaryotic. Putative 
functions of all predicted protein-coding genes have been sug- 
gested using conserved peptides present in their open reading 
frames. 

Availability: Detailed results of GeneDecipher analysis of 
all the 18 strains of SARS-CoV genomes are available at 
http://www. igib.res.in/sarsanalysis.html 

Contact: skb@igib.res.in 


*To whom correspondence should be addressed. 
"The authors wish it to be known that, in their opinion, the first two authors 
should be regarded as joint First Authors. 


‘GeneDecipher predicts polyprotein lab (265..21485) in two fragments 
(265..13 413) and (13 599..21 485) because there is a stop codon at location 
13 413. These locations are given with respect to the NCBI refseq Genome 
sequence. 
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INTRODUCTION 


Severe acute respiratory syndrome (SARS) has emerged as 
a life-threatening disease. Early reports on SARS appeared 
from China (Ksiazek et al., 2003) and subsequently, cases 
of SARS were reported from Taiwan, Vietnam, Canada, 
Singapore and other countries. The range of symptoms 
observed in SARS-affected patients is fever, dry cough, 
dyspnea, headache and hypoxemia. Typical laboratory find- 
ings include lymphopenia and mildly elevated aminotrans- 
ferase levels. Death may result from progressive respiratory 
failure due to alveolar damage (Tsang ef al., 2003). On 
an average, the mortality rate was 4%, though it varied 
widely according to the geographic location (WHO Report, 
2003, http://www.who.int/csr/sarscountry/2003_04_04/en/) 
and with the strain implicated. SARS isolates from different 
parts of the world have been sequenced recently. Sequence 
analysis of nucleic acid fragments isolated from cytopathic 
Vero cell cultures showed that the encoded protein sequences 
were similar to proteins of other coronaviruses (Drosten 
et al., 2003, www.nejm.org). However, at the nucleic acid 
level, no similarity was observed with any sequence in the 
database indicating substantial diversity. Phylogenetic ana- 
lysis showed that the isolated sequence is distinct and is 
placed between group2 and group3 coronaviruses in the tree 
(Marra et al., 2003). 

Current computational methods like GeneMark.hmm 
(Lukashin and Borodovsky, 1998), Glimmer (Salzberg et al., 
1998), etc. face difficulty in analyzing the SARS genome due 
to its small size. Methods based on hidden Markov models 
(HMM) require thousands of parameters for training. This 
makes these methods less suitable for analyzing smaller gen- 
omes. The problem compounds in the case of SARS-CoV 
genomes which are about 30 kb in length. Even the method 
most suitable for viral gene prediction till date ZCURVE_CoV 
(Chen et al., 2003) needs 33 parameters for training. 

GeneDecipher originally developed for prokaryotic gene 
prediction, needs only five parameters and can therefore 
analyze smaller genomes too. We have trained the artificial 
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neural network (ANN) on E.coli-k12 genome coding and 
non-coding regions [open reading frames (ORFs) not reported 
as a gene]. No additional training is required to predict protein- 
coding genes using GeneDecipher on viral genomes. This 
is an obvious advantage of this method over other meth- 
ods. In addition, it is very difficult to find negative training 
set (non-coding regions) for small genomes like coronavirus. 
Non-coding sequences for training are made by shuffling the 
coding sequences (Chen et al., 2003). The obviation of need to 
train specifically for the organism thus makes GeneDecipher 
suitable for such small genomes. 

In continuation, we tried to assign function to the Gene- 
Decipher predicted SARS-CoV genes using peptide literary 
based homology search tool (PLHOST), a tool for functional 
prediction developed at our laboratory. PLHOST assigns 
function based upon the presence of invariant octa/hepta pep- 
tides across proteins from different species. In this paper, 
we present the results of our analysis on 18 SARS-CoV 
genomes. 


METHODS 
SARS-CoV genome sequence 


Sequences of the 18 SARS-CoV strains available in 
the GenBank database (http://www.ncbi.nlm.nih.gov/Entrez/ 
genomes/Vviruses) were downloaded and analyzed. These incl- 
ude SARS-CoV Refseq (NC_004718.3), SARS-CoV TWC 
(AY32118), SIN2774 (AY283798), SIN2748 (AY283797), 
SIN2679 (AY283796), SIN2677 (AY283794), SIN2500 
(AY283794), Frankfurt] (AY291315), BJ04 (AY279354), 
BJO3 (AY278490), BJO2 (AY278487), GZO1 (AY278848), 
CUHKWI1 (AY278554), TOR2 (AY274119), TWI1 
(AY291451), BJO1 (AY278488), Urbani (AY278741), HKU- 
39849 (AY278491). Other information related to protein- 
coding genes were retrieved from http://www.ncbi.nlm.nih. 
gov/genomes/SARS/SARS html 


GeneDecipher: Protein-coding gene prediction 
software 


Originally, GeneDecipher was developed for prokaryotic gene 
prediction. To execute GeneDecipher on viral genomes we 
prepared a heptapeptide library derived from the proteins of 
56 completely sequenced prokaryotic genomes and 1096 viral 
genomes. 

Development of GeneDecipher is based upon the obser- 
vation that difference between total number of theoretically 
possible peptides of a given length and that which are actually 
observed in nature, grows drastically as this length of peptide 
increases. Moreover, it is interesting to note that most of these 
peptides selected by nature are found only in coding regions 
and very rarely in theoretically translated non-coding regions. 
This observation has prompted us to exploit this exclusivity of 
natural selection of peptides that are present in protein-coding 


sequences to differentiate between coding and non-coding 
regions. 

Prediction of a given ORF as a coding region/gene is 
based upon the number of heptapeptides present and the 
distribution of these heptapeptide along the ORF. Our 
output corresponding to a given ORF is a probability value 
(probability of this ORF being a gene). The final cut-off prob- 
ability is user dependent, but it is constant for a given genome 
in all six reading frames (default cut-off is 0.5). 

Here, it is worth noting that our method is independent of 
any other evidences, e.g. ribosome binding site signals (in 
order to prove the strength of the hypothesis) such kinds of 
constraints are being used by various existing methods. 

The method can be divided into five major steps (Fig. 1): 


(1) Generation of a peptide library. 


(2) Artificial translation of a given genome into six reading 
frames. 


(3) Conversion of each translated sequence into an integer- 
coded sequence (one corresponding to each reading 
frame). 


(4) Training of ANN. 
(5) Deciphering genes using trained ANN. 


PLHOST: Function assignment tool 


We used PLHOST for the identification of invariant pep- 
tides, which serve as functional signatures from completely 
sequenced genomes (Brahmachari and Dash, 2001). 

The algorithm generates organism-specific libraries of 
octa/hepta peptides from all proteins of selected genomes. 
Redundant peptides are removed from each library. These 
peptide libraries are then compared with each other to note all 
octa/hepta peptides present invariantly across a specified min- 
imum number of genomes. Overlapping octa/hepta peptides 
are backstitched to generate longer conserved peptides, which 
occur in functionally similar proteins, hence called functional 
signatures. 


RESULTS AND DISCUSSION 


A systematic sensitivity and specificity analysis of Gene- 
Decipher has been done on 10 microbial genomes (Fig. 2). 
Further analysis of GeneDecipher on viral genomes is 
presented here. 


Testing of GeneDecipher on viral genomes 


To test our method on viral genomes, we first analyzed 
human respiratory syncytial virus (HRSV), complete gen- 
ome using GeneDecipher. Comparison of GeneDecipher 
results with the state-of-the-art method ZCURVE_CoV has 
been done (Table 1). ZCURVE_CoV is able to predict 
8 annotated proteins out of 11 reported at NCBI without 
any false positives. ZCURVE_CoV was unable to predict 
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Nucleoti i 
Translate in all 6 frames ucleotide string 


HRNDRRRERz... ; 
..TGmTDADEG..... Hypothetica 


Soares a cA NEPRMEL sexe PEzPTPISV..... Pen ieeanes 


Search each overlapping heptapeptide in the 
library and report occurrence profile. Peptides 
starting with ‘m’ is replaced by ‘s’ and those 


containing ‘z’ are replaced by *’. 6 Integer coded strings 


...S111111447...000, 
..--+.000000000. ....s000000s.... 
wo #0000. cb 


The integer represents number Peptide Library format 
Split the integer strings into fragments with start (‘s’) of organisms in which the 
coded by ATG, GTG, TTG and stop codon(“’) coded heptapeptide is present in the Heptapeptide Occurrence value 
by TTA, TAG and TGA. Seven consecutive “* in the library. More than 9 occurrence 


integer coded sequence denotes end of a gene. value is treated as 9. AAAALMH 
AAAAAAC 
All possible coding regions (ORFs) ADAAAAA 


ANN trained on E. coli-K12 
genome 


KYRSATT 
LLGGRKV 
NGGDTRS 
PKYRSAT 


Predicted protein coding regions 


Fig. 1. GeneDecipher flow diagram. 
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Fig. 2. Sensitivity and specificity of GeneDecipher. 
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Table 1. Comparison of GeneDecipher results with ZCURVE_CoV results 
on HRSV genome, with respect to annotated genes 


Table 2. Protein-coding genes predicted by GeneDecipher in SARS-CoV 
Refseq common to all 18 strains 


Annotated genes ZCURVE_CoV GeneDecipher S.no. Start Stop Frame Length Feature 
Start End Length Start End Length Start End Length bp aa 
99 518 = 139 99 518 139 99 518 139 1 265 13413) 1+ 13149 4382 Sarsla polyprotein 
626 1000 124 — _— — 626 1000 124 2 7Ol 1225 2+ 525 174 Sars174 (new prediction) 
1140 2315 391 1140 2315 391 1140 2315 391 3 1397 1603. 2+ 207 68 Sars68 (new prediction) 
2348 3073 241 2348 3073 241 2348 3073 241 4 8828 9013 2+ 186 61 Sars61 (new prediction) 
3263 4033 256 3158 4033 291 3158 4033 291 5 13599 21485 3+ 7887 2628 Sars2628 (C-terminal end 
4303 4500 65 4303 4500 65 4303 4500 65 of polyprotein lab) 
4690 5589 299 — — — 4690 5589 299 6 21492 25259 3+ 3768 1255 Spike (S) protein 
5666 7390 574 5666 7390 574 5621 7390 589 7 25268 26092 2+ 825 274 Sars274 (Sars 3a) 
7618 8205 195 7618 8205 195 7618 8205 195 8 26117 26347 2+ 231 76 Sars76 (Sars4) 
8171 8443 90 9 26398 27063 1+ 666 221 Sars221 (Sars5) 
8509 15009 2166 8443 15009 2188 8443 15009 2188 10 27273 27641 3+ 369 122 Sars122 (Sars7a) 
11 28120 29388 1+ 1269 422 Sars422 (Sars9a) 
12 28559 28795 2+ 237 78 Sars78 (Identical to ORF 
14/Sars9c in TOR2 


the following three genes: PID 9629200 (location 626..1000, 
non-structural protein 2 (NS2)); PID 9629205 (location 
4690..5589, attachment glycoprotein (G)) and PID 9629208 
(location 8171..8443, matrix protein 2 (M2)). GeneDecipher 
predicted 10 out of total 11 annotated proteins of HRSV 
without any false positives. The gene missed by GeneDecipher 
was PID 9629208 (location 8171..8443, matrix protein 2), 
which was notably missed by ZCURVE_CoV too. 

This successful prediction of protein-coding regions in 
HRSV genome increases our confidence to predict protein- 
coding regions on newly sequenced SARS-CoV genomes. 


Analysis of SARS-CoV using GeneDecipher 


We analyzed all 18 strains of SARS-CoV using GeneDecipher 
(detailed results are available on the website given above). 
GeneDecipher predicts a total of 15 protein-coding regions 
in SARS-CoV genomes including both the polyproteins la 
and lab (Sars2628 C-terminal end of Polyprotein lab), and 
all four known structural proteins (M, N, S and E) for each of 
the 18 strains. GeneDecipher also predicts six to eight addi- 
tional coding regions depending on the genome sequence of 
the strain used. The length of these additional coding regions 
varied between 61 and 274 amino acids. 

GeneDecipher predicts 12 coding regions, which are com- 
mon to all 18 strains (Table 2), and one coding region (Sars63, 
Sars6 at NCBI refseq genome) present in five strains. Gene- 
Decipher predicts gene Sars90 in GZO1 strain and Sars154 
(Sars 3b at NCBI refseq genome) in BJO2 strain specifically. 

These 12 common protein-coding regions consist of the 
six basic proteins of SARS-CoV (two polyproteins and 
the four structural proteins): Sars274 (Sars3a at NCBI ref- 
seq database), Sars122 (Sars7a at NCBI refseq database), 
Sars78 (already reported with start shifted as ORF14/Sars9c 
in TOR2 strain); and three newly predicted (false positives 
with respect to current annotation at NCBI) protein-coding 


with shifted start) 


regions Sars174, Sars68 and Sars61. The three newly pre- 
dicted genes lie completely within polyprotein la genomic 
region. Although our method discards such genes in bacterial 
genomes, possibility of finding such genes in viral genomes 
has not been ruled out. As these genes are present in all 
18 strains, it is likely that they are protein-coding genes. 

We predict three more coding regions Sars63, Sars154 and 
Sars90 apart from the 12 discussed above. Sars63 is identi- 
fied in five strains and not identified in remaining 13 strains. 
This coding region is already reported in NCBI refseq (Sars6). 
Here, we cannot comment much about the existence of Sars63 
(Sars6 at NCBI refseq) because it is identified in five strains 
and not identified in rest 13. This is due to high density of 
non-synonymous mutations across strains in this region. Two 
coding regions Sars154 (Sars3b at NCBI) and Sars90 (newly 
predicted in GZO1 strain) are identified in only one strain. 
Since these two coding regions are identified in only one strain, 
they are less likely to be protein-coding regions, as also sug- 
gested by ZCURVE_CoV (Chen et al., 2003) analysis. The 
locations of these three genes in different strains are provided 
in Table 3. 

Since the peptide libraries are made from the genome 
sequences of various organisms, the evolutionary origin of 
a given protein can be traced. If the protein is rich in hepta- 
peptides found occurring in viral genomes, then that protein 
is considered to be of viral origin. We found that five core 
proteins (two polyproteins and three structural proteins M, N 
and S) are of viral origin. The remaining, including three new 
predictions, are of prokaryotic origin. It is interesting to note 
that from the same DNA region we are getting proteins in dif- 
ferent frames, which contain peptides from different origin. 
Here, how same DNA sequence can code for both bacterial and 
viral origin is intriguing. This might explain why these new 
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Table 3. Identification of Sars90, Sars63, Sars154 as protein-coding genes 
by GeneDecipher in various strains of SARS-CoV 


S.no. Strainname Sars90 (New  Sars63 (Sars6 —Sars154 (Sars3b 


prediction) at NCBI) at NCBI) 
1 SIN2748 —_ _— —_ 
2 BJO1 —_ 27055..27246 — 
3 BJO2 —_— 27074..27265  25689..26153 
4 BJO3 —_ 27070..27261 — 
5 BJ04 —_ 27058..27249 — 
6 Frankfurttl — — — 
7 Urbani _— —_ _ 
8 GZO1 24492..24764 27058..27249 — 
9 sin2500 — — — 
10 sin2677 — — — 
11 sin2679 — — — 
12 sin2774 —_ —_ _ 
13 CHUKWI — _ — 
14 TWl —_ — —_ 
15 TWC —_— — _ 
16 HKU-39849 — —_— —_— 
17 Refseq —_— — —_— 
18 TOR2 —_ — — 


protein-coding genes were not detected in primary attempts 
based on homology to other known viral genome sequences. 


Comparison with the existing 
system—ZCURVE_CoV 


Comparison of GeneDecipher, ZCURVE_CoV results with 
the known annotations for Urbani and TOR2 strains of SARS- 
CoV are presented in Tables 4 and 5. 

In general, GeneDecipher results are in good agreement 
with the known annotations. In case of Urbani strain, Gene- 
Decipher predicts all the known genes except Sars84(X5), 
Sars63(X3) and Sars154(X2). Sars84(X5) and Sars63(X3) 
are supported by ZCURVE_CoV whereas Sars154(X2) is 
missed by both the methods. GeneDecipher predicts four 
new genes in this strain which, incidentally, are not sup- 
ported by ZCURVE_CoV. It is noticeable that, out of these 
four genes, Sars78 is already known for strain TOR2 as 
ORF 14/Sars9c. This supports the likelihood of the gene being 
present in Urbani strain. However, ZCURVE_CoV predicts 
two new genes, which are not supported by GeneDecipher 
either. 

GeneDecipher predictions for TOR2 strain are identical 
with those for Urbani strain. In this strain, GeneDecipher pre- 
dicts nine known genes but fails to predict six genes with 
known annotations. These six genes are: Sars154 (ORF4), 
Sars98 (ORF13), Sars63 (ORF7), Sars44 (ORF9), Sars39 
(ORF10) and Sars84 (ORF11). Of these, Sars154 (ORF4) 
and Sars98 (ORF13) are also missed by ZCURVE_CoV. It is 
to be noted that both Sars44 (ORF9) and Sars39 (ORF10) are 
ORFs that are very small in length (44 and 39 amino acids, 


respectively), and their presence too is not consistent across 
various SARS strains. Sars63 (ORF7) has been predicted by 
GeneDecipher in five other strains but not in the two strains 
considered here. 


Mutation analysis 


Analysis using multiple sequence alignment (ClustalW) for 
three newly predicted protein-coding genes Sars174, Sars68 
and Sars61 across all 18 strains shows the following: 


1. Sars68 has one point mutation at location 80 GAT — 
GGT (D = G) SIN 2677 strain. 


2. Sars174 has two synonymous point mutations at location 
204 CGA - CGC in GZO1 strain and at location 447 
CTG — CTT in BJ04 strain. 


3. Sars61 has one point mutation at location 119 CTG > 
CAG (L > Q) in GZO1 strain. 


These three newly predicted genes are present in all 18 strains 
without significant mutations and has no significant hits with 
BLASTP in non-redundant database. This indicates that these 
three proteins might have crucial biological functions specific 
to SARS-CoV. Therefore, these coding sequences might serve 
as candidate drug targets against SARS. 


Function assignment 


In total, we predict 15 coding regions in SARS-CoV out of 
which functions of the four structural proteins (M, N, S and E) 
have already been assigned. Although the polyprotein lab has 
been assigned only replicase activity, our analysis implies that 
the replicase activity is associated with Sars2628 (C-terminal 
of ORF lab) fragment. The complete lab polyprotein contains 
six functional signatures of which polyprotein la contains 
signatures associated with metabolic enzymes (Table 6). 
Functions were assigned to the polyproteins on the basis of 
peptides (length 7 or more amino acids) occurring in proteins 
having similar functions in at least five different organisms. 
Other predicted genes/protein-coding regions contain pep- 
tides, which occur in fewer genomes. Based on these peptides 
we suggest functions, albeit with lesser confidence (Table 7). 
The biological relevance of these findings remains to be 
explored. 


CONCLUSION 


In this paper, we have predicted four new genes includ- 
ing Sars78 (already known in TOR2 strain) in SARS-CoV. 
Our analysis also corroborates the finding of ZCURVE_CoV 
(Chen et al., 2003) that ORF Sars154 (listed in Refseq as 
Sars3b) is unlikely to be a coding region. We have also 
assigned functions to the two polyproteins lab and La. In addi- 
tion to replication-associated function of C-terminal of lab 
polyprotein, our analysis implies that the polyprotein 1a may 
be associated with metabolic enzyme-like functions. In all, 
six peptide signatures are present in polyprotein lab. We have 
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Table 4. Comparison of GeneDecipher results with ZCURVE_CoV results on SARS-CoV genome Urbani strain, with respect to annotated genes 


Annotated genes ZCURVE_CoV GeneDecipher Features 
Start End Length Start End Length Start End Length 
265 13 398 4377 265 13 398 4377 265 13413 4382 ORF la 

_ _ —_— —_— — _ 701 1225 174 Sars 174 (New prediction by 
GeneDecipher) 

— — — — —_— —_ 1397 1603 68 Sars68 (New prediction by 
GeneDecipher) 

_ _ —_— —_— — _ 8828 9013 61 Sars61 (New prediction by 
GeneDecipher) 

13 398 21485 2695 13 398 21485 2695 13 599 21485 2628 ORF 1b 

21492 25 259 1255 21492 25 259 1255 21492 25 259 1255 S protein 

25 268 26 092 274 25 268 26 092 274 25 268 26 092 274 Sars274 (X1) 

25 689 26 153 154 a — —_— —_— —_— — Sars154 (X2) 

26117 26 347 76 26117 26 347 76 26 117 26 347 76 E protein 

26 398 27 063 221 26 398 27 063 221 26 389 27 063 224 M protein 

27074 27 265 63 27074 27 265 63 —_— —_— — Sars63 (X3) 

27273 27641 122 27273 27 641 122 27273 27 641 122 Sars122 (X4) 

—_— —_— — 27 638 27772 44 —_— —_— — Sars44 

— —_— — 27779 27 898 39 —_— — — Sars39 

27 864 28118 84 27 864 28 118 84 —_— —_— — Sars84 (X5) 

28 120 29 388 422 28 120 29 388 422 28 120 29 388 422 N protein 

— — — — — — 28559 28795 78 Sars78 (Identical to ORF 
14/Sars9c in TOR2 with 


shifted start) 


Table 5. Comparison of GeneDecipher results with ZCURVE_CoV results on SARS-CoV genome TOR? strain, with respect to annotated genes 


Annotated genes ZCURVE_CoV predicted genes GeneDecipher predicted genes Features 
Start End Length Start End Length Start End Length 
265 13 398 4377 265 13 398 4377 265 13413 4382 ORF la 

—_— — —_— —_— 701 1225 174 Sars174 (New prediction by 
GeneDecipher) 

_ — —_— _ 1397 1603 68 Sars68 (New prediction by 
GeneDecipher) 

— _ _— — 8828 9013 61 Sars61 (New prediction by 
GeneDecipher) 

13 398 21485 2695 13 398 21485 2695 13599 21485 2628 ORF 1b 

21492 25 259 1255 21492 25 259 1255 21492 25 259 1255 S protein 

25 268 26 092 274 25 268 26 092 274 25 268 26 092 274 ORF3 (Sars274) 

25 689 26 153 154 —_— —_— —_— oo —_— ORF4 (Sars154) 

26117 26 347 76 26 117 26 347 76 26 117 26 347 76 E protein 

26 398 27 063 221 26 398 27 063 221 26 389 27 063 224 M protein 

27074 27 265 63 27074 27 265 63 —_— —_— — Sars63 (ORF7) 

27273 27 641 122 27273 27 641 122 27273 27 641 122 Sars 122 (ORF8) 

27 638 27772 44 27 638 27772 44 —_— —_— —_— Sars44 (ORF9) 

277719 27 898 39 27779 27898 39 — —_— —_— Sars39 (ORF10) 

27 864 28 118 84 27 864 28 118 84 —_— —_— — Sars84 (ORF11) 

28 120 29 388 422 28 120 29 388 422 28 120 29 388 422 N protein 

28 130 28 426 98 — —_— — —_— —_— ORF13 

28 583 28795 70 —_— —_— 28 559 28795 78 Sars78 (Identical to ORF 
14/Sars9c in TOR2 with 


shifted start) 
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Table 6. Functional assignment of polyproteins in SARS (Urbani) Genome using PLHOST 


S. no. NCBI annotation Conserved peptide signature Function assigned 

1 Sars lab (Poly protein lab) RIRASLPT Phosphoglycerate kinase 
RSETLLPL Sulfite reductase (NADPH), flavoprotein beta subunit 
LDKLKSLL Probable acyl-CoA thiolase 
ATVVIGTS cell division protein ftsZ 
NVAITRAK DNA-binding protein, probably DNA helicase 
LQGPPGTGK DNA helicase related protein 

2 Sars la poly protein la RIRASLPT Phosphoglycerate kinase 
RSETLLPL Sulfite reductase (NADPH), Flavoprotein beta subunit 
LDKLKSLL Probable acyl-CoA thiolase 

3 Sars 2628 (C terminal of Sars lab) ATVVIGTS cell division protein ftsZ 
NVAITRAK DNA-binding protein, probably DNA helicase 
LQGPPGTGK DNA helicase related protein 


Table 7. Suggested functions for some of the non-structural genes in SARS- 
CoV using PLHOST 


S.no. Gene Peptide signature Suggested function 


1 Sars174 (new TLSKGNAQ ABC transporter ATP binding 
prediction) protein (Lactococcus lactis 
subsp. lactis) 
VAQMGTLL Cytochrome c oxidase folding 
protein (Synechocystis sp. 
PCC 6803) 
2 Sars68 (new LVLVLILA Putative major facilitator 
prediction) superfamily protein 
(Schizosaccharomyces 
pombes) 
TQTLKLDS Serine/threonine kinase 2; 
Serine/threonine protein 
kinase-2 (Homo sapiens) 
3* Sars90 (new GLLHRGT NADH dehydrogenase I chain 
prediction only 
in GZO1 strain) 

4 Sars61 (new LLPLLAFL Putative protein (conserved 

prediction) across two organisms) 

5 Sars274 (Sars3a) LLLFVTIY Polyamine transport protein; 
Tpolp (Saccharomyces 
cerevisiae) 

6 Sars154 (Sars3b) QTLVLKML K550.3.p (Caenorhabditis 
elegans) 

7 Sars63 (Sars6) DDEELMEL Elongation factor Tu 
(Lactococcus lactis subsp. 
lactis) 

8 Sars122 (Sars7a) LIVAALVF Putative transport 
transmembrane protein 
(Sinorhizobium meliloti) 

RARSVSPK Src homology domain 3 
(C.elegans) 
9* Sars78 (Sars9c) QLLAAVG Gamma-glutamate kinase 


(conserved across 8 
organisms) 


*No conserved octapeptide was found. However, function has been assigned on the basis 
of the only highly conserved heptapeptide. 


suggested putative function for other nine proteins including 
the ones newly predicted by GeneDecipher. 
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